Average word length | # of sentences | Source |
---|---|---|
5.96 | 10 | Ä |
6.01 | 11 | Stel hiela Van Maanen |
6.08 | 11 | Vödem ela Babel |
6.22 | 12 | Vol |
6.24 | 23 | Proxima Centauri |
6.30 | 28 | Ninon de l'Enclos |
6.33 | 14 | Niels Henrik Abel |
6.34 | 39 | Epsilon Eridani |
6.38 | 25 | Frederik van Eeden |
6.38 | 21 | Ludwig van Beethoven |
6.38 | 18 | Karla LaVey |
6.41 | 37 | Anaïs Nin |
6.41 | 21 | Tau Ceti |
6.41 | 18 | Filippus Johann Krüger |
6.44 | 15 | Ismail Kadare |
6.45 | 21 | Jacques-Henri Bernardin de Saint-Pierre |
6.46 | 91 | Man di Piltdown |
6.46 | 11 | Elvis Presley |
6.46 | 10 | St. Petersburg (Florida) |
6.47 | 10 | St. Paul (Nebraska) |
6.49 | 17 | Edgar Allan Poe |
6.49 | 15 | Brian Reynold Bishop |
6.50 | 12 | Albert Camus |
6.51 | 15 | Anne de Vries |
6.52 | 27 | Charles Ezra Sprague |
6.52 | 12 | Jimi Hendrix |
6.53 | 12 | Leopold Einstein |
6.54 | 18 | Edmontonia |
6.54 | 11 | Anton Cehov |
6.54 | 10 | St. Edward |
Average word length | # of sentences | Source |
---|---|---|
8.35 | 18 | Volapükagasedem |
7.49 | 18 | Dadeadamajenot ün Lordovig-Silur |
7.45 | 10 | Saurischia |
7.43 | 10 | Masiakasaurus |
7.42 | 11 | Aviatyrannis |
7.40 | 89 | Johann Martin Schleyer |
7.34 | 13 | Genyodectes |
7.33 | 31 | Carnotaurus |
7.29 | 11 | Pachypleurosauridae |
7.28 | 43 | Lordovig |
7.28 | 13 | Felix Hausdorff |
7.28 | 12 | Ichthyosaurus |
7.28 | 10 | Michael Faraday |
7.27 | 19 | Francisco Valdomiro Lorenz |
7.27 | 18 | Heterodontosaurus |
7.27 | 10 | Laurasiyop |
7.26 | 10 | Camptosaurus |
7.25 | 13 | Megalosauroidea |
7.25 | 10 | Epanterias |
7.24 | 17 | Paleozoig |
7.24 | 10 | Nomingia |
7.23 | 17 | Quetzalcoatlus |
7.22 | 22 | Fösilav |
7.21 | 13 | Benjamin Peirce |
7.20 | 24 | Carcharodontosaurus |
7.18 | 18 | Ornithischia |
7.18 | 10 | Dora d'Istria |
7.17 | 10 | Agujaceratops |
7.17 | 10 | Scipionyx |
7.16 | 25 | Silur |
The problem addressed in this subsection (as well as the results) is similar to 6.4.1.1, but now we focus on average word length instead of average sentence length.
Measuring average word length strongly depends on tokenization. The usual tokenization might split the string “28.06.2005” into five parts “28 . 06 . 2005” of average length two. To avoid this, the number of words is counted as 1 + (number of blanks in the sentence).
select round(avg(length(sentence) / (1+ length(sentence) - length(replace(sentence," ","")))),2) as le, count(sentence) as cnt, source from sentences s, inv_so i, sources so where s.s_id=i.s_id and i.so_id=so.so_id group by source having cnt>=10 order by le limit 30;
6.4.2.2 Average logarithmic word rank for different sources
6.4.2.3 Sources consisting of many / few words with frequency 1
6.4.2.4 Sources with low / high average word length of rare words